-
Notifications
You must be signed in to change notification settings - Fork 22
vibe coded buildkite prototype #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a proof-of-concept Buildkite integration for running GPU kernel submissions on vendor-donated hardware. Buildkite provides better isolation guarantees compared to GitHub Actions by binding each agent to a single GPU with CPU/RAM limits enforced via systemd cgroups.
Changes:
- Adds BuildkiteLauncher to create and monitor Buildkite builds via API
- Implements buildkite-runner.py script that executes inside Docker containers on Buildkite agents
- Provides infrastructure scripts for vendors to set up isolated GPU agents
- Includes comprehensive documentation and testing guides
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libkernelbot/launchers/buildkite.py | New launcher that creates Buildkite builds, polls for completion, and downloads artifacts |
| src/runners/buildkite-runner.py | Runner script that decompresses payload and executes submissions in Buildkite agents |
| src/libkernelbot/consts.py | Adds BuildkiteGPU enum and BUILDKITE scheduler type |
| src/libkernelbot/launchers/init.py | Exports BuildkiteLauncher |
| src/kernelbot/main.py | Registers BuildkiteLauncher if API token is configured |
| src/kernelbot/env.py | Adds Buildkite environment variables (optional) |
| tests/test_buildkite.py | Unit tests for BuildkiteLauncher, GPU enums, and runner script |
| docker/kernelbot-runner/Dockerfile | Container image with CUDA, PyTorch, and kernelbot dependencies |
| docker/kernelbot-runner/requirements-runner.txt | Python dependencies for the runner image |
| .buildkite/pipeline.yml | Buildkite pipeline configuration for running submissions in Docker |
| .github/workflows/build-runner-image.yml | Workflow to automatically build and push runner image |
| scripts/buildkite/setup-agent.sh | Script for vendors to configure per-GPU agents with resource isolation |
| scripts/buildkite/update-image.sh | Script for vendors to update runner image and restart agents |
| docs/docs/vendor-onboarding/buildkite.md | Vendor onboarding guide with setup and troubleshooting instructions |
| docs/docs/vendor-onboarding/testing-guide.md | Comprehensive testing guide covering unit, container, agent, and E2E tests |
| docs/BUILDKITE_POC.md | POC documentation describing architecture, implementation, and next steps |
| src/discord_cluster_manager.egg-info/* | Auto-generated package metadata files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| profile_artifact = None | ||
| for artifact in artifacts: | ||
| if artifact.get("filename") == "result.json": | ||
| result_artifact = artifact | ||
| elif artifact.get("path", "").startswith("profile_data/"): | ||
| profile_artifact = artifact | ||
|
|
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The profile_artifact variable is only set when iterating over artifacts if a path starts with "profile_data/", but this will only capture the first matching profile artifact. If there are multiple profile artifacts (which is possible since the pipeline config specifies "profile_data/**/*"), only the first one will be stored and its download URL will be added to all ProfileResult objects. This could lead to incorrect profile data associations. Consider either collecting all profile artifacts or being more specific about which profile artifact to use.
| profile_artifact = None | |
| for artifact in artifacts: | |
| if artifact.get("filename") == "result.json": | |
| result_artifact = artifact | |
| elif artifact.get("path", "").startswith("profile_data/"): | |
| profile_artifact = artifact | |
| profile_artifacts = [] | |
| for artifact in artifacts: | |
| if artifact.get("filename") == "result.json": | |
| result_artifact = artifact | |
| elif artifact.get("path", "").startswith("profile_data/"): | |
| profile_artifacts.append(artifact) | |
| # Determine which profile artifact to use, if any | |
| profile_artifact = None | |
| if len(profile_artifacts) == 1: | |
| profile_artifact = profile_artifacts[0] | |
| elif len(profile_artifacts) > 1: | |
| logger.warning( | |
| "Multiple profile artifacts found; not associating a specific one with runs." | |
| ) |
| class BuildkiteGPU(Enum): | ||
| # Queue naming: {vendor}-{gpu_type} | ||
| # Buildkite agents use tags like queue=nvidia-h100-0 for per-GPU routing | ||
| # The enum value is the queue prefix; agents append -N for specific GPU index | ||
| NVIDIA_H100 = "nvidia-h100" | ||
| NVIDIA_B200 = "nvidia-b200" | ||
| NVIDIA_A100 = "nvidia-a100" | ||
| AMD_MI300 = "amd-mi300" | ||
| AMD_MI250 = "amd-mi250" | ||
| GOOGLE_TPU = "google-tpu" | ||
| NEBIUS_H100 = "nebius-h100" |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BuildkiteGPU enum includes several GPU types (NVIDIA_H100, NVIDIA_B200, NVIDIA_A100, AMD_MI300, AMD_MI250, GOOGLE_TPU, NEBIUS_H100) but the GPU_TO_SM mapping dictionary (line 127-138 in consts.py) is missing entries for these new GPU types. When _get_arch is called with any of these new Buildkite GPU types, it will raise a KeyError. Add mappings for at least the NVIDIA GPUs: NVIDIA_H100 should map to "90a", NVIDIA_B200 to "100", NVIDIA_A100 to "80", and NEBIUS_H100 to "90a".
.buildkite/pipeline.yml
Outdated
| agents: | ||
| # Route to agent with matching queue tag | ||
| queue: "${GPU_QUEUE}" | ||
| timeout_in_minutes: 15 |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline timeout_in_minutes is hardcoded to 15 minutes, but the BuildkiteLauncher calculates timeouts dynamically based on config (get_timeout function) and adds a TIMEOUT_BUFFER_MINUTES. For submissions with longer configured timeouts, the Buildkite pipeline will timeout before the launcher's timeout logic can cancel the build. The pipeline timeout should either be set to a higher value or made configurable via an environment variable to align with the launcher's timeout calculation.
| timeout_in_minutes: 15 | |
| # Pipeline timeout is configurable to align with BuildkiteLauncher timeouts. | |
| timeout_in_minutes: ${PIPELINE_TIMEOUT_MINUTES:-60} |
tests/test_buildkite.py
Outdated
| return { | ||
| "lang": "py", | ||
| "mode": "test", | ||
| "files": {"main.py": "print('hello')"}, |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test config uses "files" but should use "sources" to match the expected config format used by run_config. Looking at other tests in the codebase (e.g., tests/test_backend.py and tests/test_task.py), the config passed to launchers should have "sources" as the key containing file mappings, not "files". This will cause the runner script to fail when it calls run_config with this payload.
| "files": {"main.py": "print('hello')"}, | |
| "sources": {"main.py": "print('hello')"}, |
| self.token = token | ||
| self._headers = { | ||
| "Authorization": f"Bearer {token}", | ||
| "Content-Type": "application/json", | ||
| } | ||
|
|
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Buildkite API token is stored in plaintext in an instance variable. While this is consistent with how the GitHub launcher handles tokens, consider that tokens could be exposed in logs or error messages. Ensure that repr or str methods don't accidentally expose this token, and that error messages sanitize sensitive information.
| self.token = token | |
| self._headers = { | |
| "Authorization": f"Bearer {token}", | |
| "Content-Type": "application/json", | |
| } | |
| # Store the sensitive token in a private attribute to reduce accidental exposure. | |
| self._token = token | |
| self._headers = { | |
| "Authorization": f"Bearer {token}", | |
| "Content-Type": "application/json", | |
| } | |
| @property | |
| def token(self) -> str: | |
| """ | |
| Read-only access to the Buildkite API token. | |
| Kept as a property so external code can still access `launcher.token`, | |
| while the actual storage remains in a private attribute. | |
| """ | |
| return self._token | |
| def __repr__(self) -> str: | |
| """ | |
| Representation of the launcher that avoids exposing the API token. | |
| """ | |
| return ( | |
| f"{self.__class__.__name__}(org={self.org!r}, " | |
| f"pipeline={self.pipeline!r})" | |
| ) |
| if state != "passed": | ||
| logger.warning(f"Build #{build_number} finished with state: {state}") | ||
| return |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _wait_for_completion method returns successfully for builds in states "failed", "canceled", or "blocked", not just "passed". This means the method will return normally even when the build didn't succeed, and then the code proceeds to download artifacts. A failed or canceled build may not have artifacts, which could lead to confusing error messages. Consider checking the state after _wait_for_completion returns and returning an appropriate error FullResult for non-passed states before attempting artifact download.
| if state != "passed": | |
| logger.warning(f"Build #{build_number} finished with state: {state}") | |
| return | |
| if state == "passed": | |
| return | |
| logger.warning(f"Build #{build_number} finished with state: {state}") | |
| raise RuntimeError( | |
| f"Build #{build_number} finished with non-passed state: {state}" | |
| ) |
src/runners/buildkite-runner.py
Outdated
| def main(): | ||
| # Get payload from environment variable | ||
| payload_b64 = os.environ.get("SUBMISSION_PAYLOAD") | ||
| if not payload_b64: | ||
| raise RuntimeError("SUBMISSION_PAYLOAD environment variable not set") | ||
|
|
||
| # Decompress and parse config | ||
| payload = zlib.decompress(base64.b64decode(payload_b64)).decode("utf-8") | ||
| config = json.loads(payload) | ||
|
|
||
| # Run the submission | ||
| result = run_config(config) | ||
|
|
||
| # Write result to file for artifact upload | ||
| result_dict = asdict(result) | ||
| Path("result.json").write_text(json.dumps(result_dict, default=serialize)) |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buildkite-runner.py main function does not wrap the entire execution in a try-except block. If run_config raises an exception (e.g., due to an invalid config or runtime error), the exception will propagate and no result.json will be created. This will cause the artifact download in BuildkiteLauncher to fail with a confusing "Could not find result.json artifact" error rather than showing the actual error. Consider wrapping the execution in try-except and writing an error result to result.json on failure.
| mock_download.return_value = MagicMock(success=True) | ||
|
|
||
| reporter = MockProgressReporter() | ||
| result = await launcher.run_submission(mock_config, gpu_type, reporter) |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable result is not used.
| result = await launcher.run_submission(mock_config, gpu_type, reporter) | |
| result = await launcher.run_submission(mock_config, gpu_type, reporter) | |
| assert result.success is True |
dont bother reviewing!